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Abstract 

In this paper, we first propose a Bayesian neighborhood seleetion method to estimate 
Gaussian Graphieal Models (GGMs). We show the graph seleetion eonsisteney of this 
method in the sense that the posterior probability of the true model eonverges to one. When 
there are multiple groups of data available, instead of estimating the networks independently 
for eaeh group, joint estimation of the networks may utilize the shared information among 
groups and lead to improved estimation for eaeh individual network. Our method is extended 
to jointly estimate GGMs in multiple groups of data with eomplex struetures, ineluding spa¬ 
tial data, temporal data and data with both spatial and temporal struetures. Markov random 
field (MRF) models are used fo effieienlly ineorporafe fhe eomplex dafa sfruefures. We de¬ 
velop and implemenf an effieienl algorifhm for sfalisfieal inferenee fhaf enables parallel eom- 
pufing. Simulafion sfudies suggesf fhaf our approaeh aehieves heifer aeeuraey in nelwork es- 
limalion eompared wilh melhods nol ineorporaling spalial and lemporal dependeneies when 
Ihere are shared sfruefures among fhe nelworks, and fhaf if performs eomparably well olh- 
erwise. Finally, we illuslrale our melhod using fhe human brain gene expression mieroarray 
dalasel, where fhe expression levels of genes are measured in differenl brain regions aeross 
multiple lime periods. 

Keywords: Spatial and Temporal Data; Gaussian Graphical Model; Neighborhood Selection; 
Bayesian Variable Selection; Markov Random Field. 
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1 Introduction 


The analysis of biological networks, including protein-protein interaction networks (PPI), bio¬ 
logical pathways, transcriptional regulatory networks and gene co-expression networks, has led 
to numerous advances in the understanding of the organization and functionality of biological 
systems (e.g., Kanehisa & Goto 2000, Shen-Orr et al. 2002, Rual et al. 2005, Zhang & Horvath 
2005). The work presented in this paper was motivated from the analysis of the human brain 
gene expression microarray data, where the expression levels of genes were measured in numer¬ 
ous spatial loci, which represent different brain regions, during different time periods of brain 
development (Kang et al. 2011). Although these data offer rich information on the network in¬ 
formation among genes, only naive methods have been used for network inference. For example, 
Kang et al. (2011) pooled all the data from different spatial regions and time periods to construct 
a single gene network. However, only a limited number of data points are available for a specific 
region and time period, making region- and time- specific inference challenging. 

Our aim here is to develop sound statistical methods to characterize the changes in the net¬ 
works across time periods and brain regions, as well as the common network edges that are 
shared. This is achieved through a joint modeling framework to infer individual graphs for each 
brain region in each time period, where the degrees of spatial and temporal similarity are learnt 
adaptively from the data. Our proposed joint modeling framework may better capture the edges 
that are shared among graphs, and also allow the graphs to differ across brain regions and time 
periods. 

We represent the biological network with a graph G = {V, E) consisting of vertices V = 
(1, ...,p} and edges E C V x V. In this paper, we focus on conditionally independent graphs, 
where {i,j) ^ if oiily if node i and node j are not conditionally independent given all 
the other nodes. Gaussian graphical models (GGMs) have been proven among the best to infer 
conditionally independent graphs. In GGM, the p-dimensional X = (Xi,..., Xp) is assumed 
to follow a multivariate Gaussian distribution Denote 0 = the precision ma¬ 

trix. It can be shown that the conditional independence of Xi and Xj is equivalent to Qij ^ 0: 
Xi _LL Xj I Xv\{i,j} Qij = 0. In GGM, estimating the conditional independence graph 
is equivalent to estimating the non-zero entries in 0. Various approaches have been proposed to 
estimate the graph (Meinshausen & Biihlmann 2006, Yuan & Lin 2007, Friedman et al. 2008, 
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Cai et al. 2011, Dobra et al. 2011, Wang et al. 2012, Orchard et al. 2013). Among these methods, 
Friedman et al. (2008) developed a fast and simple algorithm, named the graphieal lasso (glasso), 
using a eoordinate deseent proeedure for the lasso. They eonsidered optimizing the penalized 
likelihood, with penalty on the preeision matrix. As extensions of glasso, several approaehes 
have been proposed to jointly estimate GGMs in multiple groups of data. Guo et al. (2011) 
expressed the elements of the preeision matrix for eaeh group as a produet of binary eommon 
faetors and group-speeifie values. They ineorporated an fi penalty on the eommon faetors, to 
eneourage shared sparse strueture, and another penalty on the group-speeifie values, to allow 
edges ineluded in the shared strueture to be set to zero for speeifie groups. Danaher et al. (2014) 
extended glasso more direetly by extending the fi penalty for eaeh preeision matrix with addi¬ 
tional penalty funetions that eneourage shared strueture. They proposed two possible ehoiees of 
penalty funetions: 1. Fused lasso penalty that penalizes the differenee of the preeision matriees, 
whieh eneourages eommon values among the preeision matriees; 2. Group lasso penalty. Chun 
et al. (2014) proposed a elass of non-eonvex penalties for more flexible joint sparsity eonstraints. 
As an alternative to the penalized methods, Peterson et al. (2014) proposed a Bayesian approaeh. 
They formulated the model in the G-Wishart prior framework and modeled the similarity of mul¬ 
tiple graphs through a Markov Random Field (MRF) prior. However, their approaeh is only 
applieable when the graph size is small (~ 20) and the number of groups is also small (~ 5). 

In this paper, we formulate the model in a Bayesian variable seleetion framework (George & 
MeCulloeh 1993, 1997). Meinshausen & Biihlmann (2006) proposed a neighborhood seleetion 
proeedure for estimating GGMs, where the neighborhood of node i was seleeted by regressing 
on all the other nodes. Intuitively, our approaeh is the Bayesian analog of the neighborhood 
seleetion proeedure. Our framework is applieable to the estimation of both single graph and 
multiple graphs. For the joint estimation of multiple graphs, we ineorporate the MRF model. 
Compared with Peterson et al. (2014), we use a different MRF model and a different inferential 
proeedure. One advantage of our approaeh is that it ean naturally model eomplex data struetures, 
sueh as spatial data, temporal data and data with both spatial and temporal struetures. Another 
advantage is the eomputational effleieney. For the estimation of a single graph with 100 nodes 
(the typieal size of biologieal pathways is around that range), the eomputational time on a laptop 
is ~ 30 seeonds for 1, 000 iterations of Gibbs sampling, whieh is ~ 3-folds faster than Bayesian 
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Graphical Lasso, which implements a highly effieient bloek Gibbs sampler and is among the 
fastest algorithms for estimating GGMs in the Bayesian framework (Wang et al. 2012). For 
multiple graphs, the eomputational time inereases roughly linear with the number of graphs. 
Our proeedure also enables parallel eomputing and the eomputational time ean be substantially 
redueed if multieore proeessors are available. For single graph estimation, we show the graph 
seleetion eonsisteney of the proposed method in the sense that the posterior probability of the 
true model eonverges to one. 

The rest of the paper is organized as follows. We introduee the Bayesian neighborhood selee¬ 
tion proeedure for single graph and the extension to multiple graphs in Seetion 2. Model seleetion 
is diseussed in Seetion 3. The theoretieal properties are presented in Seetion 4. The simulation 
results are demonstrated in Seetion 5 and the applieation to the human brain gene expression 
mieroarray dataset is presented in Seetion 6 . We eonelude the paper with a brief summary in 
Seetion 7. 


2 Statistical Model and Methods 


2.1 The Bayesian Neighborhood Selection Procedure 


We first eonsider estimating the graph strueture when there is only one group of data. Consider 
the p-dimensional multivariate normal random variable X ~ A/'(/i, S). We further assume that 
X is centered and X ~ A/'(0, S). Let 0 = denote the preeision matrix. Let the n x p 
matrix X = (Xi,..., Xp) eontain n independent observations of X. For A C { 1 ,..., p}, define 
X^ = (Xp j G A). Let Fj denote the subset of {1,... ,p}, exeluding the ith entry only. For 
any square matrix C, let denote the fih row, exeluding the ith. element in that row. Consider 
estimating the neighborhood of node i. It is well known that the following eonditional distribution 
holds: 


Xj I Xr. -^(-Xr.efr.e-'.e-'/) 


( 1 ) 


where I is the n x n identity matrix, Qa is a sealar and finding the neighborhood of X^ is 
equivalent to estimating the non-zero eoeffieients in the regression of Xi on . Let /3 and 7 
be matriees of dimension p x p, where /3jri = and 7 is the binary latent state matrix. 

The diagonal elements in (3 and 7 are not assigned values. Conditioning on 7 *^, (^ij is assumed 
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to follow a normal mixture distribution (George & MeCulloeh 1993, 1997): 

Aj I lij ~ (1 - 7ii)A'(0,rfo) + 7yA/'(0,r2), ^ 

where Tiolrn = 5 and 0 < <5 < 1. The prior on 7 *^ is Bernoulli: 

p{iij = 1) = 1 - piriij = 0) = g. 

5 , Til and q are prefixed hyperparameters and are diseussed in the Supplementary Materials. The 
off-diagonal entries in 7 represent the presenee or absenee of the edges, whieh is the goal of our 
inferenee. 

Let cr = ((Ji,..., (Tp), where af = The inverse gamma (IG) eonjugate prior is assumed 
for af: 

7 ~ IG{i/i/2,\i/il2). 

In this paper, we assume that z/j = 0 and the IG prior reduees to a flat prior (Li & Zhang 2010). 

For eaeh node, we perform the Bayesian proeedure to seleet the neighbors of that node. The 
preeision matrix 0 is symmetrie. If we let /3jr, = instead of —the symmetrie 

eonstraint ean be satisfied by foreing /3 to be symmetrie. However, this will lead to substantial 
loss in eomputational effleieney sinee the elements in /3 have to be updated one at a time, instead 
of one row at a time. Our simulation results suggest that the performanee of the two models is 
eomparable, whether or not the eonstraint on /3 is assumed (data not shown). Therefore, we do 
not eonstrain /3 in our inferenee. The seleeted neighborhood should be symmetrie (the support 
of 0 is symmetrie). Meinshausen & Biihlmann (2006) suggested using an or/and rule after their 
neighborhood seleetion proeedure for eaeh node. In our Bayesian proeedure, the symmetrie 
eonstrain ean be ineorporated naturally when sampling 7 by setting 7 ^ = 7 ^* for j ^ i. When 
there is no eonstraint assumed, the Bayesian proeedure ean be performed independently for eaeh 
node. 

2.2 Extension to mutiple graphs 

When there is similarity shared among multiple graphs, jointly estimating multiple graphs ean 
improve inferenee. We propose to jointly estimate multiple graphs by speeifying a Markov Ran¬ 
dom Field (MRF) prior on the latent states. Our model ean naturally ineorporate eomplex data 
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structures, such as spatial data, temporal data and data with both spatial and temporal structures. 
Consider jointly estimating multiple graphs for data with both spatial and temporal structures. 
Denote B the set of spatial loci and T the set of time points. Our proposed model can be natu¬ 
rally implemented when there is missing data, i.e. no data points taken in certain locus at certain 
time point. For now, we assume that there is no missing data. The latent states for the whole 
dataset are represented bya|i?| x |T| x p x p array 7 , where | | denotes the cardinality of a 
set. Let jbt- denote the latent state matrix for locus b at time t. In the real data example, 6 is a 
categorical variable representing the brain region and t is a discrete variable that represents the 
time period during brain development. Same as that in Section 2.1, the diagonal entries in 'jbt.. 
are not assigned values. 

Consider estimating the neighborhood of node i. Let jhaj denote the latent state for node j G 
ri in locus b at time t. Denote 7 ..^^ = { 7 ^^ :\/b e B^t e T}, E^j = {■.b^b',t = 
t'} and = {( 7 ;,**^, 76 /^/*^) b = b' and \t — t'\ = 1}. Here contain all the pairs capturing 
spatial similarity and contain all the pairs capturing temporal dependency between adjacent 
time periods. We do not consider the direction of the pairs: (7*^^, 'lijb't') and {%jb't': lijbt) are the 
same. Let /i(-) and /o(-) represent the indicator functions of 1 and 0, respectively. The prior for 
7 ..ij is specified by a pairwise interaction MRF model (Besag 1986, Lin et al. 2015): 


I oc exp< Pi E {.lijbt) 

I b&B,t&T 

Vs ^ ^ T ^l{'Vbtij)^l{'lh't'ij) 

E‘. 

Vt ^ ^ ^l{'Vbtij)^l{'Vb't'ij) 

and conditional independence is assumed: 


+ 


( 2 ) 


p(7 I $) = nn P{l-ij I ^), (3) 

i j&Ti 

where # = {viiVsiVt} are set to be the same for all i and j. 771 G M and when there is no 
interaction terms, 1/(1 + exp(—771)) corresponds to q in the Bernoulli prior. 77^ G M represents 
the magnitude of spatial similarity and 77^ G M represents the magnitude of temporal similarity. 
In the simulation and real data example, 771 is prefixed, whereas ps and pt are estimated from the 
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dataset. Discussion on the choice of rji is provided in the Supplementary Materials. The priors 
on r]s and r]t are assumed to follow uniform distribution in [0, 2]. 

Let 'y.-ijl'ybtij denote the subset of excluding '^buj- Then we have: 


Pilbtij I J-ij/lbUj, 


exp{jbtijF{'ybtij, ^)} 

1 + exp{F(7Mij,#)}’ 


( 4 ) 


where 


F{jbuj, =r]i + r]s ^ (2-fb'tij - 1 ) 

b'eB,b'^b 

+ rit{It^i[2'yb(t_i^ij — 1 ] + ItjLT[2'yb{t+i)ij — !]}• 

In (2), we made the following assumptions: a) the groups with different spatial labels are 
parallel to each other and they have the same magnitude of similarity and b) the time periods are 
evenly spaced and can be represented by integer labels. The first assumption can be relaxed by 
adjusting (2) in two ways: 1. let r]^ vary for different pairs of loci (e.g. let r]s be some parametric 
function of the pairwise distance); 2. adjust to incorporate complex structure for the spatial 
loci (e.g. sub-groups or some graph to describe the adjacency of spatial loci). For the second 
assumption, r]t can be adjusted to a parametric function of the time interval. When there is only 
spatial or only temporal structure in the dataset, model (2) can be adjusted by removing the 
summation over the corresponding pairs. 

3 Model selection 

For single graph and multiple graphs, the posterior probabilities are sensitive to the choices of 
hyperparameters. The ROCs are much more robust to the prior specification (Supplementary 
Materials). The posterior probability can be used as a score to rank the edges. One way of doing 
model selection is to sort the marginal posterior probabilities and select the top K edges, where 
K may depend on the prior knowledge of how sparse the graphs should be. An alternative way 
is to first perform thresholding on the marginal posterior probabilities to get an estimate of the 
graph structure Q, and the precision matrix 0 can be obtained by a fast iterative algorithm (Hastie 
et al. 2009). By varying the threshold, different 0s are obtained and some model selection criteria 
(for example, BIC) can be incorporated to select a model. 
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4 Theoretical Properties 


We rewrite p as pn to represent a sequence pn that changes with n. Let I < p* < Pn- Throughout, 
we assume that X satisfies the sparse Riesz condition (Zhang & Huang 2008) with rank p*; that 
is, there exist some constants 0 < Ci < C2 < oo such that 

. I|Xa«IP , 

Cl < — Trip- < C2, 

n M r 


for any A C {1 ,... with size |yl| = p* and any nonzero vector u G 

Consider estimating the neighborhood for the ith. node. We borrow some notations from 
Narisetty et al. (2014). For the simplicity of notation, let /3* = /3jri = 7* = 7*ri- 

Write Tio, th and q as ron, Tin and g„, respectively, to represent sequences that change with n. 
We use a {pn — 1) x 1 binary vector /c* to index an arbitrary model. The corresponding design 
matrix and parameter vector are denoted by X^i = (XpJ^i and /3^i, respectively. Let f represent 
the true neighborhood of node i. 

Denote by Amax(-) and Xmini') the largest and smallest eigenvalues of a matrix, respectively. 
For v > 0, define 


m{v) = niniv) 


{Pn 


1 )A 


n 

(2 + v) \og{pn 


1 ) 


and 




/X' XfcA 

mm Amin ( - ) • 

k'^\\k'^\<m(v) Tl J 


For iF > 0, let 




min 


ll(/-Pfc0Xt.AMlL 


where |/c* | denotes the size of the model /c* and P^i is the projection matrix onto the column space 
of Xfci. 


For sequences and 6„, a„ ~ hn means 0^/6^ —)■ c for some constant c > 0, -< 6^ 

(or bn y an) means an = o{bn), and an A bn (or bn 7 a^) means an = 0{bn)- We need the 
following conditions. 


(A) Pn ^ oo and pn = 0{n% for some 6^ > 0; 


(B) Qn = Pn ^ for some 0 <q;<1A(1/6*); 

(C) nTon = o(l) and ~ n V for some (5i > 1 + a; 


8 



(D) |f I ^ n/\ogpn and ||/3*i||| ^ logpn; 

(E) there exist 1 + a < 82 < 5 i and K > 1 + 8/(52 — 1 — a) such that, for some large C > 0, 

> C\t\ log(n V 


(F) p* > {K + l)\t% 

(G) Amax(X'X/n) -< {nTQ^)~^ /\{nTl^) and there exist some 0 < v < 62 and 0 < k < 2{K — 1) 

such that 

(n V 




nr^n 


ypn- 


Theorem 1 . Assume conditions (A)-(G). For some c > 0 and s > 1 we have, with probability 
at least 1 — cp~^, P{Y = I > ^ — ^n, where goes to 0 as the sample size increases 

to 00 . 

To establish graph-selection consistency, we need slightly stronger conditions than (D)-(G). 
Fet 

t* = max |t*|,A*(A')= min (Ai(A')/(T/) and = min 

l<i<Pn l<i<pn l<i<p„ 

(D’) t* -< n/\ogpn and maxi<i<p„ ||/3/||i -< r/„logpn; 

(E’) there exist 1 + a < ^2 < 5i and K > 1 + 8/(52 — 1 — a) such that, for some large G > 0, 
A*{K) > Glog(nVp2+25i). 


(F’) p* > {K + l)t*; 


(G’) Amax(X'X/n) -< (nrg^) ^A(nr/„) and there exist some 0 < v < S 2 and 0 < k < 2{K — 1) 
such that 

(n V 


Kn{v) A 


nr- 




In 


Fet Q denote the true graph structure and 7 is the latent state matrix for all the nodes. 
Theorem 2. Assume conditions (A)-(C) and (D’)-(G’)- We have, as n —)■ cx), P (7 = Q 


X,cr2) ^ 1. 
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5 Simulation examples 

5.1 Joint estimation of multiple graphs 

We first considered the simulation of three graphs. For all three graphs, p = 100 and n = 150. 
We first simulated the graph structure. We randomly selected 5% or 10% among all the possible 
edges and set them to be edges in graph 1. For graphs 2 and 3, we removed a portion (20% or 
100%) of edges that were present in graph 1 and added back the same number of edges that were 
not present in graph 1. 20% represents the case that there is moderate shared structure. 100% 
represents the extreme case that there is little shared structure other than those shared by chance. 
For the entries in the precision matrices, we considered two settings: a) the upper-diagonal entries 
were sampled from uniform [—0.4, —0.1] U [0.1, 0.4] independently and then set the matrix to be 
symmetric b) Same as that in a), except that for the shared edges, the corresponding entries were 
set to be the same. To make the precision matrix positive definite, we set the diagonal entry in a 
row to be the sum of absolute values of all the other entries in that row, plus 0.5. 
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(a) Sparsity~0.05, change=0.2, (b) Sparsity~0.1, change=0.2, (c) Sparsity~0.05, change=l, 
different entry values different entry values different entry values 





(d) Sparsity^O.l, change=l, 
different entry values 


(e) Sparsity~0.05, change=0.2, (f) Sparsity~0., change=0.2, 
same entry values same entry values 



(g) Sparsity~0.05, change=l, 
same entry values 



(h) Sparsity~0.1, change=l, 
same entry values 


Figure 1: Comparisons of different models for the estimation of three graphs. For the shared 
edges, the eorresponding entries in the preeision matriees take the same (“same entry values”) or 
different (“different entry values”) non-zero values. The x-axis was truneated to be slightly larger 
than the total number of true positive edges. The eurves represent the average of 100 independent 
runs. 
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The simulation results are presented in Figure 1. Our method (MRF) was compared with 
Guo’s method (Guo et al. 2011), JGL (Danaher et al. 2014) and graphical lasso (glasso) (Fried¬ 
man et al. 2008). In glasso, the graphs are estimated independently. In JGL, there are two options, 
fused lasso (JGL-Fused) and group lasso (JGL-Group). For Guo’s method, glasso and JGL, we 
varied the sparsity parameter to generate the curves. For our method, we varied the threshold 
for the marginal posterior probabilities of 7 to generate the curves. There are two tuning pa¬ 
rameters in JGL, Al and A 2 , where Ai controls sparsity and A 2 controls the strength of sharing. 
We performed a grid search for A 2 in {0, 0.05,..., 0.5} and selected the best curve. In Figure I, 
our method performed slightly better than Guo’s method. When there is little shared structure 
among graphs, our method performed slightly better than glasso, which is possibly due to the 
fact that we used a different modeling framework. When the entries were different for the shared 
edges, JGL-Fused did not perform well. However, when the entries were the same, JGL-Fused 
performed much better. The fused lasso penalty encourages entries in the precision matrix to be 
the same and JGL-Fused gains efficiency when the assumption is satisfied. 

5.2 Joint estimation of multiple graphs with temporal dependency 

In this setting, we assumed that the graph structure evolved over time by Hidden Markov Model 
(HMM). We set p = 50. At time t = 1, we randomly selected 10% among all the possible edges 
and set them to be edges. At time t -f 1, we removed 20% of the edges at time t and added back 
the same number of edges that were not present at time t. The entries in the precision matrix 
were set the same as that in a) in Section 5.1. We present the simulation results in Figure 2, 
varying n and |T|. We compared our method with Guo’s and JGL-Group, where the graphs were 
treated as parallel. Our method performed better than Guo’s method and JGL-Group in all three 
settings, and the difference was greater when either n or |T| increases. We did not include JGL- 
Fused in the comparison as the computational time for JGL-Fused increases substantially when 
the number of graphs is more than a few. 
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(a) n = 100, |r| = 10 (b) n = 200, |r| = 10 (c) n = 100, |r| = 30 

Figure 2: Comparisons of different models for the estimation of mutiple graphs with temporal 
dependeney. The x-axis was truneated to be slightly larger than the total number of true positive 
edges. The curves represent the average of 100 independent runs. 

5.3 Joint estimation of multiple graphs with both spatial and temporal de¬ 
pendency 

We simulated graphs in \B\ = 3 spatial loci and |T| = 10 time periods. We setp = 50, n = 100, 
and sparsity~ 0.1. We first set the graphs in different loci at the same time point to be the same. 
The graph structure evolved over time by HMM similarly as that in Section 5.2, and 40% of the 
edges changed between adjacent time points. For all graphs, we then added some perturbations 
by removing a portion (10%, 20%, 50%) of edges and adding back the same number of edges. For 
simplicity, we treated the spatial loci as parallel and did not simulate more complex structures. 
The entries in the precision matrix were set the same as that in a) in Section 5.1. The simulation 
results are presented in Figure 3. Our method achieved better performance than all the other 
methods. 

5.4 Computational time 

We evaluated the computational speed of the Bayesian variable selection procedure in the esti¬ 
mation of single GGM and multiple GGMs. For single GGM, we compared our method (B-NS) 
with Bayesian Graphical Lasso (B-GLASSO) (Wang et al. 2012) in Figure 4a. Our algorithm 


13 















(a) perturbation=0.1 


(b) perturbation=0.2 (c) perturbation=0.5 


Figure 3: Comparisons of different models for the estimation of mutiple graphs with temporal 
and spatial dependeney. The x-axis was truneated to be slightly larger than the total number of 
true positive edges.The eurves represent the average of 100 independent runs. 


took 0.5 and 4.5 minutes to generate 1,000 iterations for p = 100 and p = 200, and B-GLASSO 
took 1.6 and 17.9 minutes. We also evaluated the speed of our algorithm for the joint estimation 
of multiple graphs, where n and p were both fixed to 100. The CPU time was roughly linear 
as the number of graphs inereased (Figure 4b). All eomputations presented in Figure 4 were 
implemented on a dual-eore CPU 2.4 GHz laptop running OS X 10.9.5 using MATLAB 2014a. 
The computational cost of our algorithm is 0{p^). When p = 500, for single GGM, our algo¬ 
rithm took ~ 60 minutes for 1,000 iterations. In all the previous examples, we did not implement 
parallel computing. The computational time may be substantially reduced when the method is 
implemented in parallel by multicore processors (data not shown). Even for larger p {p >500), 
our method may still be applicable if parallel computing is implemented. 


6 Application to the human brain gene expression dataset 

Next we apply our method to the human brain gene expression microarray dataset (Kang et al. 
2011). In the dataset, the expression levels of 17, 568 genes were measured in 16 brain regions 
across 15 time periods. The time periods are not evenly spaced over time and each time period 
represents a distinct stage of brain development. Because of the small sample size, we incorpo¬ 
rated the MRF model as in equation (2) and did not consider more complex extensions: the brain 
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(a) Single Graph 


(b) Multiple Graphs 


Figure 4: CPU time for 1,000 iterations of sampling. The left plot shows the CPU time for single 
graph with inereasing node size, we eompared our method (B-NS) with Bayesian Graphical Lasso 
(B-GLASSO) (Wang et al. 2012). The right plot shows the CPU time with increasing number of 
graphs, where the node size was fixed to 100. 


regions were treated as parallel and the time periods were treated as discrete variables from 1 to 
15. We excluded the data from time periods 1 and 2 in our analysis because they represent very 
early stage of brain development, when most of the brain regions sampled in future time periods 
have not differentiated. We also excluded the data where the number of replicates is less than 
or equal to 2 (since a perfect line can be fitted with two data points), this step further removes 8 
groups of data: (brain region) “MD” (in time period) 4, “SIC” 5, “MIC” 5, “STR” 10, “SIC” 11, 
“MIC” 11, “STR” 11 and “MD” 11. The number of replicates varies across brain regions/time 
periods and the number is less than 10 in general, with a few exceptions. We studied the network 
of 7 genes. These 7 genes are high confidence genes that have been implicated in their roles 
for Autism Spectrum Disorders (ASD): GRIN2B, DYRKIA, ANK2, TBRl, POGZ, CUL3, and 
SCN2A(Willsey et al. 2013). ASD is a neurodevelopment disorder that affects the brain and have 
an early onset in childhood. With a good understanding on the networks of the 7 ASD genes 
across brain regions and time periods, we hope to gain insight into how these genes interact spa¬ 
tially and temporally to yield clues on their roles in autism etiology. The posterior mean and 
standard deviation for r]s were 0.56 and 0.51, respectively. The posterior mean and standard devi¬ 
ation for rjt were 0.95 and 0.63, respectively. The estimated model parameters suggest moderate 
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sharing of network structure across brain regions and time. 

MFC OFC VFC DFC 





Figure 5: The estimated graphs for all brain regions except “STR” in time period 10. Each graph 
corresponds to one brain region. Period 10 corresponds to early childhood (1 years < age < 6 
years), corresponding to the period of autism onset. 

For each graph, we selected the top 5 edges with the highest marginal posterior probabilities. 
Time period 10 corresponds to early childhood (1 years < age < 6 years), which is the typical 
period that patients show symptoms of autism. The graphs for all brain regions except “STR” 
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Time period 5 

GRIN2B 
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Figure 6: The estimated graphs for brain region “MFC” aeross time. 


(excluded data) are shown in Figure 5. Of particular interest are the genes that are connected 
with TBRl, which is a transcription factor that may directly regulate the expression of numerous 
other genes. The edge between TBRl and DYRKIA is mostly shared among the brain regions (7 
regions). DYRKIA is a protein kinase that may play a significant role in the signaling pathway 
regulating cell proliferation and may be involved in brain development (Di Vona et al. 2015). It 
may be interesting to check whether TBRl physically binds to DYRKIA during brain develop- 
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ment. The graphs for region “MFC” across time are demonstrated in Figure 6. Because of the 
limit of space, we only show the temporal dynamics for one brain region. There are moderate 
sharing of edges over time. Although the edge between TBRl and DYRKIA is not present in 
time period 10, it is present in time periods 5, 8, 12 and 13. Further biological experiments are 
required to validate whether the interaction between TBRl and DYRKIA changes over time or 
it is caused by the lack of power to identify true edges due to the small sample size. 

7 Conclusion 

In this paper, we proposed a Bayesian neighborhood selection procedure to estimate Gaussian 
Graphical Models. Incorporating the Markov Random Field prior, our method was extended 
to jointly estimating multiple GGMs in data with complex structures. Compared with the non- 
Bayesian methods, there is no tuning parameter controlling the degree of structure sharing in 
our model. Instead, the parameters that represent similarity between graphs are learnt adap¬ 
tively from the data. Simulation studies suggest that incorporating the complex data structure 
in the jointly modeling framework would benefit the estimation. We implemented our method 
by a fast and efficient algorithm that are several-fold faster than arguably the fastest algorithm 
for Bayesian inference of GGMs. Applying our method to the human brain gene expression 
data, we identified some interesting connections in the networks of autism genes during early 
childhood. We also demonstrated the graph selection consistency of our procedure for the esti¬ 
mation of single graph. The Matlab code is available at https : / /github . com/ linzxO 6 / 
Spatial-and-Temporal-GGM. 
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1 The posterior sampling procedure 

For the estimation of multiple graphs, the steps for updating (3 and cr are the same as that for 
single graph, and thus we only show the update for single graph. 

1.1 Updating of (3 

The rows in /3 ean be updated independently. Consider updating the ith row of /3. Let Di be a 
(p — 1) X (p — 1) dimensional diagonal matrix, and the jth diagonal entry takes the following 
value: 

The full eonditional distribution of follows multivariate Gaussian: 

Ar, I . ~ Ar((X^^Xr, + A)-'X'r^X„aAX'r^Xr, + A)-'). (D 

1.2 Updating of cr 

The elements in cr = (ui,..., ap) ean be updated independently. The full eonditional distribution 
of (Tj follows inverse Gamma distribution: 

a, I . ~ IGiiu, + n)/2, (Xu, + (X, - Xr,ArJ'(X. - Xr,ArJ)/2). 
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As Vi = 0, 


I . ~ /G(n/2, (Xj - Xr./3'r,)'(X. - Xr./3;r.)/2). 

1.3 Updating of 7 

For single graph, the full conditional probability of can be shown to be: 

Pilij I •) oc I 7 y)p( 7 ij), for j ^ i 

If there is symmetric constraint, the upper diagonal elements of 7 can be updated as: 

Vilij I •) oc p{l3ij I %j)p{/3ji I 7ji = 77 )p( 77), for J > *• 

Then let 7 ^* = 7 ^. 

For multiple graphs, the full conditional probability of 'yuij can be shown to be: 

Pilbtij I •) oc p{l3btij I lbtij)p{lbtij I 1-ijhbtij, ^), for j ^ i. 

If there is symmetric constraint, the full conditional distribution will be: 

Pi^'lbtij I •) OC p(^( 3 btij I 'Jbtij'jpi^Pbtji \ '^btji '^btij^Pi^'^btij \ ■■ij j^btiji for J 7 %. 

Then let 'yttij- 


1.4 Updating of $ 

$ = {pi-,Ps-,Pt}- Vi is prefixed. Then we have: 

p{vs,vt I •) (xp{ps,vt)ph I =p{vs)p{vt) nn ph-ij I ^), 

i jeFi 

where p{ps) and p{pt) represent the uniform priors on ps and pf The normalizing constant in 
p{'y..ij I $) is generally not tractable as one has to sum over all possible configurations 

of 'y.-ij. We approximate p( 7 .y | $) by pseudolikelihood (Besag 1986): 

ph-ij I ~ nn Piyibtij I 'y-'ij/'ybtiji^')- 

beB teT 

The Metropolis-Hastings (MH) algorithm was implemented to update ps and pt. 
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1.5 Computation 

When p is large, the most eomputationally intensive step is updating /3, whieh involves the inver¬ 
sion of (p — 1) X (p — 1) matriees. The explieit matrix inversion ean be avoided by first performing 
Cholesky deeomposition. Consider updating f3i^^ by sampling from a multivariate Gaussian dis¬ 
tribution with mean (Xp^Xr^ -f Di)“^Xp.Xj and eovarianee matrix (Tj^(Xp.Xri + By 

Cholesky deeomposition, Xp.Xp. + Di = R'R, where -R is a upper-triangular matrix. The eom- 
putational eost for Cholesky deeomposition is (p — 1)^/3 and it is faster and more stable than 
direet matrix inversion. Next we sample Z from A/’(0, 1) and (Xp.Xr^ + D,i) ^Xp^Xj-f (Tii? 
follows the desired distribution of 

(X'p^Xp^ + A)"'X'p^X, + aiR-^Z = i?-^((i?-i)'X^ Xi + a^Z). 

(R“^)'Xp^Xj and subsequently i?“^((i?“^)'Xp^Xj + aiZ) can be solved by forward and back¬ 
ward substitution and the computational cost is 0((p — 1)^). 

Note that our model enables parallel computing in two levels: 1. for the estimation of a single 
graph, the rows in /3 can be updated in parallel independently; 2. for the joint estimation of multi¬ 
ple graphs, the matrix (3 for each graph can be updated in parallel independently. When multicore 
processors are available, parallel computing will result in substantial gain in computational speed. 

2 Choosing the hyperparameters 

2.1 Choosing ti and 6 

We performed diagnosis for ri and 5 in single graph simulations. We simulated random graphs 
with sparsity~ 0.05 or 0.1, p = 100, and n = 50 or 150. Let Si be the standard deviation of Xj 
and let Th = Isi. We fixed q to the corresponding sparsity level and varied I and 6 (Figure 1). 
When n = 150, the performance did not vary much when / and 5 varied. We set / = 0.1 and 
5 = 0.1 in the simulation of multiple graphs. When n < p, numerical issues can arise as the 
entries in will be close to 0 and XpXr- -f Di will become singular. The numerical issue can 
be avoided by decreasing g or /. For the real data example, we set / = O.ln/p if n < p. 
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Figure 1: Performance of the model for different I and 6 (single graph). The curves represent the 
average of 100 independent runs. 
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Figure 2: Performance of the model for different q (single graph). The curves represent the 
average of 100 independent runs. 


2.2 Choosing the Bernoulli parameter q 

We simulated random graphs with sparsity~ 0.05 or 0.1, p = 100, and n = 50 or 150. We 
fixed I = 0.1, (5 = 0.1 and varied q. The graph selection consistency requires q = p~^. In our 
simulation examples, the posterior probabilities are sensitive to q, but the ROC is quite robust to 
the choice of q (Figure 2). 
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(a) Sparsity~ 0.05 (b) Sparsity^ 0.1 


Figure 3: Performance of the model for different rji. The curves represent the average of 100 
independent runs. 

2.3 Choosing 771 in the MRF prior 

In the simulation studies, the performance of our model was quite sensitive to the choice of r]i. 
This may be caused by the fact that we used pseudolikelihood for approximation. We simulated 
three random graphs same as that in Section 5.1 in the main text, where p = 100, n = 150 and 
change= 0.2. We found that when rji = —0.5, our proposed model has reasonable performance 
(Figure 3). Therefore we fix rji = —0.5 for both the simulation studies and the real data example. 


3 Proofs 


Let F* = Xj and F* = Xy\|j}. Since the proof is independent of the node index, in the sequel 
we suppress the script i. We also suppress v and K from the notation of m{v), \m{v), and A(iL). 
Throughout, we use c', s' and w' to denote generic positive constants that can take different values 
each time they appear. 

Define the Bayes factor of model k with respect to the true model t as 


BF{k,t) 


P{y = k\Y,Z,a^) 
P{-f = t\Y,Z,a^)- 


Let Dk = dmg{T{^k + - k)} and Rk = Y'{I - Z{Dk + Z'Z)-^Z'}Y. 
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Lemma 1. Under the conditions of Theorem 1, for any k t, we have 


BF{k,t) < wfnrlXUl - exp{-(/2fc - Rt)/{2a^)}, 

where = rank{Z]fj,rl = Am, and = o(l) uniformly in k. 

Proof of Lemma 1. This is Lemma 4.1 of Narisetty et al. (2014), and we omit the proof. 

Let Rk = r{/ - + Z',Zk)-^Z',}Y and Rl = Y'{I - Pu)Y. 

Lemma 2. Assume the conditions of Theorem 1. Let j and k be models such that j C k. We have 
{Rj - Rk){l - M;n)(l - Cn)' < Rj -Rk< {Rj - Rk){l - 

where Wn = o(l) uniformly in k, and = nTQ^\ma.x{Z'Z/n) = o(l). 

Proof of Lemma 2. This is Lemma A.l (iii) of Narisetty et al. (2014), and we omit the proof. 

Lemma 3. Assume the conditions of Theorem 1. For any e > 0 and any sequence Qn such that 
logpn Y Qn, we have 

P{\R*t — na‘^\ > ena^) < exp(—c n) 

and 

P{Rt - R* > gn) < exp{-cnT^^gn). 

Proof of Lemma 3. This is Lemma A. 2 of Narisetty et al. (2014), and we omit the proof. 
Proof of Theorem 1. Following Narisetty et al. (2014), we divide the set of possible incorrect 
models into four subsets 

Ml = {k ■. Tk > m}, 

M 2 = {k : k ^ t, k t,rk < m}, 

M 3 = {k : k ^ t, K\t\ < Tk < m}, 

and 

M 4 = {k : k ^ t,rk < A'|t|}. 

We shall prove BF{k, t) 0 for each m = 1, 2, 3, and 4. 

Unrealistically large models. We first consider the models in Mi, which correspond to 
all the models containing at least m linearly independent covariates. Note that Mi is empty if 


7 



Pn < n/ \og{p‘^~^'^). Thus, we assume for the moment that pn > n/ \og{pf^^''). Then we have 
rl = m = n/ log(p^+^) for all k G Mi. For any s > 0, 

P[UkGMi{Rt - Rk> {I + 2s)a^n}] 

< P{Rt > (1 + 2 s)a‘^n} 

< P{Rt > (1 + 2 s)a‘^n} 

< P{R* > (1 + s)(j‘^n} + P{Rt — Rl > s(j‘^n). 

By Lemma 3, 

P[UfceAri{-Ri - Pfc > (1 + 2s)(T^n}] < 2exp(-c'n). 

Restrieting to the event {Rt — Rk < (1 + 2s)a'^n,Wk E Mi}, Lemma 1 and Condition (G) 
give 

BF{k, t) P {nrl^Xmil - exp{(l + 2s)n/2} 

k&Mi 

fceMi 

Note that m = n/ log(p^+'^), |t| = o(m), and v < 62 - Then 

Z BF(k,t) 

k&Mi 

P Y exp{-n(l + 62 )/i 2 + <52)}g|f'“'‘'Aj*l/'exp{(l + 2 s)n/ 2 } 

fceMi 

= y; exp{-n(l + S,)/(2 + S,) + (1 + 2s)n/2},l*l-l‘lA-l‘l''l 

k&Mi 

By Conditions (B) and (G), we have 

BF{k,t) 

k&Mi 

r< exp{-n(l + AJ/(2 + + (1 + 2s)n/2}p;'l‘l p'*' 

fceAfi 

Pn / \ 

P exp{-n(l + (52)/(2 + 52 ) + (1 + 2s)n/2}p^'l*l Y 

|fc|=m+l 1/ 

P exp{-n(l + 82)/{2 + 82 ) + (1 + 2s)n/2}p^'l*l(l + 
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By Conditions (A) and (B), pn log(l+g„) ^ Pn = o{n), and by Condition (D), |t| \og{pn) = o(n). 
Hence 

BF{k,t) A exp(—w'n) 

keMi 

for some w' > 0, if s satisfies 2s < 62/(2 + 62 ). Therefore, with probability at least 1 — 
2 exp(—c'n), 

J 2 BF{k,t)^ 0 . (2) 

fceMi 

Over-fitted models. For k e M 2 , we have 

r;-ri = r(Pt - Pt)Y = |i(p» - p,)Yf, = ii(fi - PM'i = - f.)£. 

where e ~ A^(0, a^J). Since e'(Pk — Pt)e/a‘^ follows the chi-squared distribution with 
degrees of freedom, for any s > 0 and a > 1, we have 

P{R* - Rl> 0-^(2 + 2sa)(rk - n) logpn} 

= > (2 + 2sa)(rfc-rt)logp„| 

= F I > 1 + (2 + 2 sa) log Pn — 1 

[r-k-rt 

< exp{-(rfc - rt)(l + wsa) logp„} 

for some 1/a < te < 1. Hence 

P{R; -Rl> a\2 + 2sa)(rfc - n) logp4 < (3) 

Consider s > 1 and a > 1. Define the event 

A(k) = {Rt - Rk> 24(1 + sa)(rk - n) logpn}. 

By Lemma 2, we have, for some 1 < a' < a, 

A(k) C {Fj - Ffc > 24(1 + sa)(rfc - rt)(l-4) logp„} 

C {Ff - Ffc > 24(1 + sa^’^fc - n) log Pn} 

C {Ff - Rl> 24(1 + sa)(rk - n) logpn}- 
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For a fixed dimension d > rt, consider the event U{d) = U{fcgM 2 :rfc=d}^(^)- Then, for any 
1 < w' < a', 


P{U{d)} 

< Pp{k(iM 2 -.rk=d}{Rt - R*k> 2cr^(l + sa'){d - n) logp^}] 

< P[U^k(iM 2 -.rk=d}{R*t - Rl> ^^(2 + 2 sw'){d - rt) logPn}] (4) 

+P[Rt - R* > cr^(2sa' - 2 sw'){d - rt) logp„]. 

The event {i?* —Rl > a'^{2+2sw') (rk—rt) logp^} depends only on the projection matrix Pk—Pt- 
When rt < r^ = d < m, the cardinality of such projections is at most (p„ — 

This, together with (3), (4) and Lemma 3, yields 

P{U{d)} < +exp(-c'nri„logpn) (5) 


Hence 


p <Y.p{u(d)} < < c’p-'. 

d>rt d>rt 

Restricting to the event nd>rt{U{d)Y, by Lemma 1 and Condition (G), we have 


BF{k,t) P {nr^^Xmil - (l>n)} 

keM2 fceM2 


-{rk-rt)/2 \k\-\t\ (l+sa){rk-rt) 
'dn yn 


p 


E 

fceM 2 


l+sa \ rk n 
Pn W 


P^^^ V 


„l+sa „ \ Pn 

r< 11+ V ' -1- 

Pn V ^/n 


Since 82 > I + a, there always exist s > 1 and a > 1 such that 


Pi 


log (1 I 1 - ■ - 


p. 


l+sa+Q: 


V ph~^^^ y y/n J ph~^^^ y y/n ph~^^^ y y/n ^ ^ 


Therefore, with probability at least 1 — for some s > 1, we have 


BF{k, t)—)-0. 

k&M 2 


( 6 ) 
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Large models. Models in M 3 do not contain one or more active covariates with dimension at 
least K\t\ + 1. Since > -Dfcvo we have 

Rk Rk\Jt 

= Y'{I - Z{Dk + Z'Z)-^Z'}Y - Y'{I - Z{Dkyt + Z'Z)-^Z'}Y 
= Y'{Z{Dkyt + Z'Zy^Z' - Z{Du + Z'Z)-^Z'}Y > 0. 


Similar to the proof in the previous part, we define the event 

B{k) = > 2(t^( 1 + 4s)(rfc - rt) logpnj 

C > 2a^(l + 4s)(rfc - r^) logp^j 

C > 2(J^(1 + 2s)(rfc - rt) logpn}, 


and consider the union of such events V(d) = UfkGM 3 :rk=d}R(k). For s > 0, we have 


R{V(d)} < RlUikeM3:rk=d}{Rt - Rkvt > 2a^(l + 2 s){d - n) logp^}] 

< P[U{keM3:rk=d}{Rt - Rkwt > 0-^(2 + 3s)(d - rt) logp^}] 
+ P{Rt - R*t > cT‘^s{d - rt) logpn} 

< P[U{k&M3:rk=d}{Rt - Rkyt > a^(2 + 3s) (d - rt) logpn}] 
+ exp(-c'|t|nri„logpn). 


Further, 

P{P*t - Pkyt > 0-^(2 + 3s)(rfc - rt) logpn} 

< P{P*t - Plyt > 0-^(2 + 3s)(rfc - rt) logpn} 

_ pi ^ (2 + 3s)(rfc - n) logpn 

\\kAP\ \kAP\ 

pi X^kAt-\ {2 + 3s){K-1) log Pn 

K 

< exp{-|/c A + s) logpn} 

< exp{-(rfc - rt)(l + s) logp„}. 



Similar to (5), 

P{V{d)} < p-i^+^)id-rt)+d exp{-c'\t\nT^Jogpn) < 
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By Condition (E), there exists some c > 1 sueh that 

4(c+l) 


K-l> 


(52 - 1 - a) 

Set s = {c + 1){K — 1)”^ Then As < 62 — I — a and 

/, ^d — Tt _ .K — 1 ^ c 

Henee, for some c > 1, 

Pp{d>K\t\}V{d)] < cp~'". 

Restrieting to the event r\{d>K\t\}{y{d)Y, by Lemma 1 and Condition (G), we have 

Z BF(k,i) 


keMs 








k&Ms 


E 

keMs 


„l+4s „ \ rk-rt 

Pri dn \ . _|t|/2 

1+52 ^ ! rn 


V y/n 


Note that — rt > {K — l)|t| for k G M 3 . By Conditions (B) and (G), 


J2 BF(k,t) F j2 ( ,L:.. ^ 1 pf 


k&Ms 




k£M3 


E 

k£M3 


plY^^ V Yn 


1+As+a \ Pk n 
r'n 


plY^^ V 


< 1 + 


p. 


l+4s+a \ Pn 


V y/n 


- 1 . 


It is easy to show that 


(^1+45+0: \ ^1+45+0: 

1 + -ra+7^ ~ P. lE', ^ 2 = “(i)- 

Pn V YnJ Pn V Yn 

Therefore, with probability at least 1 — c'p~'^ for some c > 1, 

BF{k,t)^0. 

k&Ms 

Under-fitted models. Note that 


Rl-RY, = Y'{P,^,-P,)Y 

= \\{Pkvt-Pk)Y\\l 


> {IKPfcvi - Pk)ZAh - WiPkvt - Pk)ehV. 


(7) 
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By definition, ||(Pfcvt - Pfc)^tAI |2 = 11(1 - Pk)Ztl3t\\2 > VA for all k e M 4 . We have, for any 

w' e ( 0 , 1 ), 


< 

< 

< 

< 


ppk&MARl - Rlvt < (1 - 

P[UfceAr4{||(-Pfcvt — Pk )^\\2 > 


P{\\Pteh>w'^) 

(|«| ^ 

exp(-c'A). 


This implies that for w e (0,1), 

P [^k&M4,{Rk — Rkyt < A (1 -w)}] 

< P [UfceAr 4 {-^fc ~ Rkyt < A (1 — w)}] 

< ppu^MARl-Rlyt<H^-^im 

+ P ^k&Mi{Rkyt ~ Rkyt > A'U7/2}] . 

< exp(-c'A) + P [UkeMARkyt - Rlvt > Aw/2}] . 

Let Zy-yi Unx\kyt\R\kyt\x\kyt'^\j^\/f\x\kyt\ SVD of Zf^yt. Then 

Rkyt-Rlyt = Y'{I-Z,yt{T{^^I + Zl^,Z,y,)-^Z',^,}Y-Y'{I-P,yt)Y 
= Y'{P,yt - Z^ytin^I + Zl^,Z,yt)-^Zl^,}Y 
= Y'{UU' - UKV'ir^^VV' + KAV')“VAP'}y 
= Y'U{I - + K^)-^K}U'Y 

= Y'U{I + Tl^k^)-^U'Y 
< {s'nTl^y^Y'UU'Y. 


Since Unx\kyt\ is an unitary matrix with rank at most {K + 1) \t\, by Condition (D), we have 


P{Rkyt - Rlyt > Aw/2 ) 


= P{Y'UU'Y > s'nr/„Aw/2) 
< P(e'Pfcvte > w'nr/„A) 
p ( ^IfcVil ^ 4 I 

l^lfcvtr (y^\kyt\ 

P exp(-c'nr/„A), 
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and hence, 


P [UfcgM4 {-RfcVi ^kyt >Aw/2}] A p^l‘l+‘exp(-cWJ„A) 

A exp(-s'A). 


Therefore, 


P[VJk(iMi{Rk - Rkvt < A(1 - w)}] < 2exp(-c'A). 

This, together with Lemmas 2 and 3, for 0 < c = 3w < 1, we have 
-P[UfceM 4 {-Rfc — Rt < A(1 — c)}] 

< P[^keM4{Rk — Rkyt ^ A(1 — 2w)}] + P[UkeM 4 {Rk'vt — Rt < —Aw}] 

< P[Uk£M4{Rk — RkVt < A(1 - w)}] + P[UkeM4{Rt - Rkvt > w^'A}] 

< 2exp(-c'A) + P[UkeM 4 {Rt - R*kvt > w^A}] 

< 2exp(-c'A) + P[[Jk(^M 4 {Rt - Rkvt > w^A/2}] + P{Rt - Rl > w^A/2) 

< 3exp(-w'A) + P[\JkeM 4 {R*t - Rlvt > w‘^A/2}]. 


It is easy to show that 


P{R: - Rl^, > w^A/2) 


= P 


. P I ^^A/2 \ 

\kM-\ a^\kM-\ ) 


P exp(—c'A). 


By Condition (E), 


P [\Jk^M 4 {R*t - Rlvt > w^A/2}] 


P pfl‘l+^exp(-c'A) 
P exp(—w'A). 


We have, for some c > 1, 

P[VJk&M 4 {Rk - Rt< A (1 - c)}] < 4exp(-c'A) < 
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Restricting to the event {Rk — Rt > A(1 — c),\/k e M 4 }, by Condition (E), we have 

keMi 

A Y1 exp{-A(l - c)/2(t2} 

k&Mi 

^ Y, exp{-A(l - c)/2<r^} 

fceM4 

- V n)] g-l*l 

A:eM4 

“ -^{^(1 - c) - log(p^+^^'Vn) - 2a^(it: +1-a)|t|logp„} 

^ 0 . 


Therefore, with probability at least 1 — for some c > 1, we have 

Y BF{kR)^Q. ( 8 ) 

keMi 

Combining (2), ( 6 ), (7), and ( 8 ), the proof is complete. 

Proof of Theorem 2. Note that 


P(7 = e|X,a-") > P|n£,{7‘ = f} |X,a-"| 

= 1-P|uf;i{7' = *‘r |X,0-"| 

Pn 

> l-5;P|{7' = fr|X,(r?|. 


Under (D’)-(G’), c, s and in Theorem 1 can be chosen to be independent of the node index. 
Hence, as n —)■ cx). 


P{l = Q 


1 — cY^ - > 1 — c- 


The proof is complete. 
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